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Water quality prediction is aided by environmental monitoring, ecological 
sustainability, and aquaculture. Traditional prediction approaches capture the 
nonlinearity and non-stationarity of water quality well. Due to their rapid 
progress, artificial neural networks (ANNs) have become a hotspot in water 
quality prediction in recent years. ANNs are utilised in this study to predict 
water quality using soft computing techniques. The feedforward network and 
the standard back-propagation method of Levenberg-Marquardt and scaled 
conjugate gradient learning algorithm were employed in this research. One 
hidden layer has been recommended for the modelling, with the number of 
hidden neurons set at 3, 24, and 49. For this analysis, six different testing 
percentages were used, and the output data can be categorised as '0' for clean 
water and '1' for polluted water. From the results, it can be shown that the 
most optimised model was from the model of trainlm with a testing 
percentage of 18% and with 3 number of neurons. This most optimised 


model obtains an accuracy of 91.7%, the best validation performance of 
0.073346 with 24 epochs, and having a receiver operating characteristic 
(ROC) curve that is closer to the true positive rate compared to other 
samples. 
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1. INTRODUCTION 

Fresh water consumption has increased in many parts of the world due to population growth and 
socioeconomic development. The world's population will require 64 billion cubic metres of freshwater per 
year by 2050, when it is estimated to reach 7.2 billion people and grow at a rate of 77 million people per year. 
Nonetheless, developing countries will be responsible for 90% of the projected three billion people by 2050, 
the majority of whom will live in water-scarce locations. Based on a 2% annual growth rate, domestic and 
industrial water demand in Malaysia alone is expected to rise by more than 20% in the next 50 years [1]. 
Malaysia is a fast-developing country on its way to achieving the 2020 objective. The development, on the 
other hand, has a severe environmental impact, particularly on water quality. Rapid urbanisation, which results 
from the development of residential, commercial, and industrial sites, as well as infrastructure and other 
facilities, are the main causes of river pollution [2]—[4]. Poor water quality management can be disastrous to 
human civilization, as it can lead to disease outbreaks. Most countries have built water quality management 
frameworks to ensure water quality because of the detrimental consequences to human health if water quality 
is not properly managed. With the expansion of water quality management in 1985, the creation of water 
quality criteria and benchmarks based on the water quality index began [5]. Poor water quality can also be a 
concern because when a problem arises, resources must be redirected to improve water delivery infrastructure. 
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To address water quality issues, water quality modeling has been developed using current computing and 
artificial intelligence (AI) techniques. Artificial neural networks (ANNs) have aided in the monitoring of water 
quality systems by detecting changes in water quality. Feed-forward neural networks, for example, have been 
employed in a variety of applications. ANN models require parameter values in order to design predictions. 
ANNs offer several advantages, including the ability to learn, manage highly complex nonlinear systems, and 
work in parallel [4]. By skipping complicated procedures and utilizing a step function as the activation 
function (p), which creates the output value, the single-layer neural network can be generated rapidly [6]. 

The chemical, physical, and biological qualities of water are referred to as "water quality". Thus, in 
order to define water quality, several physical, biological, and chemical parameter elements that have a 
significant impact on it must be recognized [7]. These parameters feature a body of water to indicate its 
suitability for a certain value, such as potability, ecosystem status, agriculture, industry, or recreation [8]. 
Having access to high-quality water is essential in our daily life. Water quality is important not only for 
drinking but also for agriculture, industry, human life, and the ecosystem [9]. Furthermore, water is the most 
crucial aspect in human well-being and economic progress. Individuals, as well as all living creatures, 
horticulture, and industrialization, require water [10]. The quantity and quality of water for sustaining 
livelihoods, human well-being, and socio-economic development cannot be secured without intentional efforts 
to solve water resources management challenges [11]. Water quality monitoring at the moment is primarily 
based on manual sampling detection and underwater sensor networking. In addition, a typical water quality 
test includes three main steps: water sampling, sample testing, and investigative analysis [12]. However, 
manual sample and detection has been shown to be ineffective since it is unable to monitor dynamically at a 
fixed time and fized point, and it is costly in terms of personnel demand [13], [14]. 

In coastal locations, water quality indicators such as conductivity or electrical conductivity (EC) and 
total dissolved solids (TDS) are frequently used. Total dissolved solids (TDS) refers to the inorganic salts and 
small amounts of organic matter in solution in water. The most common elements are calcium, magnesium, 
sodium, and potassium cations, as well as carbonate, hydrogen carbonate, chloride, sulphate, and nitrate anions 
[15], whereas electrical conductivity (EC) refers to the water's ability to carry electrical current. TDS and EC 
can be found in natural and man-made environments, such as geological conditions and the ocean, as well as 
household, industrial waste and agriculture. Dissolved ion concentrations, ionic strength, and temperature 
measurements are used to establish its capabilities. Salt concentration is also described as water quality 
indicators for conductivity (EC) and total dissolved solids (TDS). Accordingto the United States 
environmental protection agency (US EPA), the maximum pollutant level of TDS is 500 parts per million 
(ppm), while according to the world health organization (WHO), the maximum contaminant level is 1,000 
parts per million (ppm) [16]. This indicates that TDS should be kept between 500 and 1,000 mg/L for health 
reasons, and EC should not exceed 1,500 uS/cm [17]. 

ANNs have been the focus of many scientific domains, including ecology, analytical chemistry, and 
water quality. The ANN, also known as a neural network, was created to simulate the operation of a human 
brain. There are various types of ANN depending on the function used, but for this project, the type of ANN 
used was the Levenberg Marquardt (LM) algorithm and scaled conjugate gradient (SCG) feedforward 
backpropagation. The training algorithm that was used to calibrate the model parameters is very important for 
the network to approximate complex non-linear input-output relationships. Based on previous studies, the 
Levenberg Marquardt algorithm and scaled conjugate gradient were used as the training algorithms because 
they could achieve high accuracy in modelling parameters. The accuracy that has been achieved by previous 
researchers for LM is 95.9%, whereas for SCG is 90%. In addition to that, another learning algorithm that 
gives high accuracy is the support vector machine (SVM) at a level of 87.10% [18]-[20]. Thus, in this 
research, the objective was to prove that ANN can achieve the highest accuracy for predicting the level of 
water pollution. Based on the literature review, this research focused on developing a classifying system that 
could identify the condition of tap and drain water which is best done using LM and SCG learning algorithms 
for the analysis section, the classification inference between measured data from clean and polluted water is 
based on statistical analysis methods. The intensity of the network can be detected based on the group 
behaviour of the connected neurons, and the output is determined by assessing its output by examining its 
input. 

This network's main benefit is that it learns how to assess and detect input patterns [21]. While the 
hidden layer, or link between neurons, is for illustrating the system's complexity [22]. In engineering, neural 
networks serve two important functions, which are pattern classifiers and also as nonlinear adaptive filters 
[23]. Based on the research done by E. Salami, the input arcs from other hidden nodes or input nodes are 
connected to each node, which is included in this research. In ANN, the process of developing the system 
models is done in the hidden layer via a system of weighted 'connections’. At the end of the process, the output 
layer will represent the network results [24], [25]. 
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2. METHOD 

In this paper, a statistical analysis-based approach using reliable total dissolved solid (TDS) and 
conductivity (EC) of water was proposed to develop a classifying system that could identify the condition of 
drain and tap water quality by using the Levenberg-Marquardt algorithm and scale conjugate gradient 
algorithm. Figure 1 shows one of the neural networks used in this project, which consists of two inputs, a 
single hidden layer that contains three number of neurons, and one output. The data collected for this project 
was taken by using a total dissolved solid (TDS) Sensor to detect the value of TDS and an EC sensor to 
detect the value of the electrical conductivity of the water andthe block diagram that was used in this research 
to evaluate the samples based on the parameters is shown in Figure 2. The flowchart of the entire system 
process is shown in Figure 3, and each part of the flowchart will be discussed in detail throughout this 
section. 
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Figure 1. ANN structure that is used for this project 
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2.1. Data collection and sample preparation 

A total of 200 data were collected from the Sungai Besar River in Sabak Bernam, Selangor, 
Malaysia. The selection of this river is based on the factor of high-water consumption for agricultural 
activities such as palm oil, rubber tree plantations and paddy fields. At the beginning of the data collection 
process, 100 samples of tap water were taken as clean water whereas 100 samples ofdrain water from the 
river were taken as polluted water. The water sample was collected from December to January 2021. The 
water was collected directly from the drain using water quality measurement device. The water wasstored in 
a container that was guaranteed to be contaminant-free. To perform the test, a minimum of 50 mL of water 
was collected from the tap and drain. The sample containers were placed in a box that was kept in a wet 
location to control the water temperature. 


2.2. Measurement of TDS and EC 

The formula as in (1) was used to calculate the EC value from the TDS value. The EC reading from 
the experiment in milli Siemens/centimeter (mS/cm) is multiplied by 1000 and divided by two to get an 
approximate TDS value for water. While for the EC value, the TDS (in parts per million (ppm)) value is 
multiplied by two and divided by 1000 as in (2). 


TDS (ppm) = ECx 1000 / 2 (1) 
EC(mS/cm) = TDS x2 / 1000 (2) 


2.3. Training, testing and validating using trainlm and trainscg 

The terms "trainlm" (Levenberg Marquardt) and "trainscg" (scale conjugate gradient) refer to two 
different types of training models that were used to analyse and evaluate the performance of ANN. The 
model was put to the test using 200 testing samples. The testing percentage was tested from a range of 15% 
to 20%. The number of neurons was increased until 49 in a step size of 2, with only the initial, middle and 
last number of neurons being considered. The name of the model was selected as “type of learning 
algorithm testing percentage number of neurons” for example “trainlm_16_ 3”. This showed that the tests 
were carried out for every 3, 24, and 49 neurons. The percentages employed in each model are shown in 
Table 1. The neural network was trained with 200 samples. Its performance was evaluated using the 
confusion matrix's parameters consisted of specificity, sensitivity, and accuracy. 


Table 1. The training, testing, and validating percentage 
Total no. of samples: 200 


Training Testing Vallidating Ratio of samples 
70% 15% 15% 1:140 141:170 171:200 
70% 16% 14% 1:140 141:172 173:200 
70% 17% 13% 1:140 141:174 175:200 
70% 18% 12% 1:140 141:176 177:200 
70% 19% 11% 1:140 141:178 179:200 
70% 20% 10% 1:140 141:180 181:200 


2.4. Performance evaluation method using ANN 

Following the testing of the samples, the MATLAB outcome included four figures: neural network 
training (NNtraintool), plot performance, receiver operating characteristic (ROC), and confusion matrix 
research. The number of neurons that was used to test the sample was displayed in the hidden layer portion. 
Examples of the generated figure from ANN are illustrated in Figure 4 where Figure 4(a) shows the network 
architecture based on the hidden layer with 49 number of neurons and the output is either '1' for polluted or '0' 
for clean water. Meanwhile, Figure 4(b) illustrates the plot performance for the training, validation, and test 
performance of the training record. 

In general, as the number of training epochs increases, the error reduces, but it may gradually climb 
on the validation data set as the network begins to overfit the training data. In ANN, an epoch is one cycle of 
the complete training dataset. It normally takes several epochs to train a neural network. It is well known that 
an epoch can cause an iteration to fail [26]. A confusion matrix is a machine learning concept that stores data 
about a classification system's actual and expected classifications. A confusion matrix has two dimensions, 
one for the actual class of the item and the other for the anticipated class of the classifier [27]. The key 
portion of analysing the neural network's performance is based on the test confusion matrix, which will then 
be compared to the specificity, sensitivity, and accuracy values. 
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Figure 4. Examples of generated figures from ANN where in (a) the schematic diagram of ANN architecture 
and (b) graph of plot performance for mean square error 


There are four measurements in the confusion matrix: true positive (TP): both the standard and the 
predicted outcome are positive. true negative (TN): both the standard and the predicted value are negative. 
false positive (FP) occurs when the standard is negative, but the predicted value is positive. false negative 
(FN): the standard is positive, while the predicted outcome is negative as shown in Figure 5. A variety of 
overall performance were developed based on the confusion matrix. 
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Figure 5. Confusion matrix for training, testing validation, and overall 


3. RESULTS AND DISCUSSION 

In this research, mean square error (MSE), epoch, specificity, sensitivity, and accuracy for each 
testing percentage and number of neurons will be evaluated and discussed. Table 2 shows the values that had 
been obtained from the mean square error vs epoch validation graph. The best validation performance is where 
the validation performance and the number of epochs is low. Thus, the best validation performance is when the 
testing percentage is at 19% with 0.058764 of mean square error (MSE) and at 1 epoch. Meanwhile, for Table 
3, the best validation performance was when the testing percentage is at 20% with 0.048046 of mean square 
error (MSE) and at 1 epoch. 

Based on Table 4, there were three models that gave the best accuracy compared to other models. 
Firstly, even though the model with a testing percentage of 16% and 49 neurons obtained the highest accuracy 
in the confusion matrix, it was not selected as the best model. This is because, the high number of neurons 


Indonesian J Elec Eng & Comp Sci, Vol. 26, No. 3, June 2022: 1684-1691 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 O 1689 


would affect the complexity of the system compared to the other two models with samples that had lower 
number of neurons. So, in order to compare the models, other parameters should be considered as well. The 
parameters were testing percentage, number of neurons, accuracy, validation performance, epoch, and ROC. 
Table 4 shows that model trainlm with 18% testing percentage, an accuracy of 91.7%, validation performance 
of 0.073346 with 24 epochs and only 3 number of neurons was the most optimised model compared to others. 
In addition, the curve of ROC from the model shows a better approach to the true positive rate compared to the 
other models. It shows that the LM learning algorithm can achieve high accuracy in modelling parameters. 
This can be seen based on the consistent value of accuracy as shown in Table 5. 


Table 2. Best validation performance and epoch for mean square error in trainlm 
Testing % Neuron Best validation performance Epoch 


(a) 15 49 0.070041 44 
(b) 16 49 0.068756 68 
(c) 17 3 0.081770 2 
(d) 18 3 0.073346 24 
(e) 19 3 0.058764 1 
(f) 20 3 0.062318 3 


Table 3. Best validation performance and epoch for mean square error in trainscg 
Testing % Neuron Best validation performance Epoch 


(a) 15 3 0.074121 14 
(b) 16 3 0.071393 2 
(c) 17 3 0.081021 2 
(a) 18 3 0.086117 6 
(e) 19 3 0.059714 3 
(f) 20 3 0.048046 1 


Table 4. Comparison between samples 


Model Trainlm_16_3 Trainlm_18_3 Trainscg_ 18 3 
Testing 16% 18% 18% 
Neuron 49 3 3 
Accuracy 93.8% 91.7% 88.9% 
MSE 0.068756 0.073346 0.086117 
Epoch 68 24 6 
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Table 5. Comparison between models in previous studies 


Algorithm Accuracy 
Levenberg-Marquardt 95.9% 
Scaled conjugate gradient 90% 
Support vector machine 87.10% 
Trainlm_16_3 93.8% 
Trainlm_18_3 91.7% 
Trainscg_18_3 88.9% 


4. CONCLUSION 

This research was undertaken to predict the river water quality using the soft computing techniques of 
ANN that used standard back-propagation method that are Levenberg-Marquardt algorithm and scaled conjugate 
gradient as the learning technique. In this paper, a comparison between the two learning algorithms has been 
proven to achieve the objective which was to classify the quality of water in the Sabak Bernam river most 
effectively. Both models usedone hidden layer for modelling, and the number of hidden neurons was set at 3, 24, 
and 49. In addition to that, six different testing percentages were used for this analysis, which were (15%, 16%, 
17%, 18%, 19%, and 20%). From the obtained results, it can be shown that the best model was from the model of 
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trainlm at a testing percentage of 18% with 3 number of neurons and an accuracy of 91.7%. The best validation 
performance of this model was 0.073346 with 24 epochs and having a ROC curve that was closer to true positive 
rate compared to other samples. It is concluded that the main objectives of this work were successfully achieved, 
which was to find the most optimised model based on the Levenberg-Marquardt algorithm and the scaled 
conjugate gradient learning algorithm. This system can be expanded in the future by considering integrating it 
with internet of things (IoT) capabilities, making it fully automated and implementing the sensing activity on 
other sensors. In addition, the classification process will be the main essence of this research, which aims to find 
the most optimised model with the highest performance at the highest achievable accuracy. 
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